{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "4e2bd86c-ed80-44c0-9b52-7d470c82bf8c",
   "metadata": {},
   "source": [
    "## 安装必要的库\n",
    "> 请提前准备好mindspore和mindnlp的安装\n",
    "\n",
    "首先，安装所需的Python库：\n",
    "- `-q` 表述静默安装，不会出现很多的`Requirement already satisfied`等等"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "c64baeda-7159-4436-b8ea-60b5a0a7cfa3",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[33mWARNING: You are using pip version 21.0.1; however, version 24.2 is available.\n",
      "You should consider upgrading via the '/home/ma-user/anaconda3/envs/MindSpore/bin/python3.9 -m pip install --upgrade pip' command.\u001b[0m\n"
     ]
    }
   ],
   "source": [
    "!pip install -q huggingface_hub ipywidgets opencv-python"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ca7abaa5-f492-4028-a7b2-d4d0fa0882e1",
   "metadata": {},
   "source": [
    "## 设置环境变量\n",
    "设置Hugging Face的国内镜像站点以加快下载速度："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "6ea48e86-02ac-4259-813e-b6a26a739f95",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "os.environ[\"HF_ENDPOINT\"] = \"https://hf-mirror.com\"       "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7bf038e6-af73-4245-9070-6caf552fecfd",
   "metadata": {},
   "source": [
    "## 下载加载视频\n",
    "- 使用`huggingface_hub`下载视频文件\n",
    "- 使用`ipywidgets`展示视频"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "00c9d670-4aa2-473d-9d3f-334c0a253bb0",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "d493073379764beca81e0eebb2229939",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Video(value=b'\\x00\\x00\\x00 ftypisom\\x00\\x00\\x02\\x00isomiso2avc1mp41\\x00\\x00\\x00\\x08free...', width='500')"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from huggingface_hub import hf_hub_download\n",
    "from ipywidgets import Video\n",
    "\n",
    "file_path = hf_hub_download(\n",
    "    repo_id=\"nielsr/video-demo\", filename=\"eating_spaghetti.mp4\", repo_type=\"dataset\"\n",
    ")\n",
    "Video.from_file(file_path, width=500)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ea2e6ad0-bf9c-41f4-974a-0dd4fc082269",
   "metadata": {},
   "source": [
    "## 定义采样函数\n",
    "`sample_frame_indices`通过在给定的帧范围内随机选择一段视频片段，并返回这段片段的帧索引。\n",
    "- `clip_len`: 需要采样的帧数。\n",
    "- `frame_sample_rate`: 帧采样率，决定了采样的密度。\n",
    "- `seg_len`: 视频的总帧数。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "be6ef4ac-ed52-461c-a325-3b6c3d01146e",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "np.random.seed(0)\n",
    "\n",
    "def sample_frame_indices(clip_len, frame_sample_rate, seg_len):\n",
    "    # 计算转换后的长度(在给定采样率下，实际需要的帧数长度)\n",
    "    converted_len = int(clip_len * frame_sample_rate)\n",
    "    # 选择结束索引\n",
    "    end_idx = np.random.randint(converted_len, seg_len)\n",
    "    # 计算开始索引\n",
    "    start_idx = end_idx - converted_len\n",
    "    #使用np.linspace在开始索引和结束索引之间生成clip_len个等间距的索引\n",
    "    indices = np.linspace(start_idx, end_idx, num=clip_len)\n",
    "    # 使用np.clip确保所有索引都在有效范围内，并将它们转换为整数类型\n",
    "    indices = np.clip(indices, start_idx, end_idx - 1).astype(np.int64)\n",
    "    return indices"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d55a7435-b7db-4291-b5fa-d16136e5f120",
   "metadata": {},
   "source": [
    "## 读取视频帧\n",
    "利用OpenCV"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "b045662d-58a9-42e2-9af3-3202c9006089",
   "metadata": {},
   "outputs": [],
   "source": [
    "import cv2\n",
    "def read_video(file_path, indices):\n",
    "    # 打开视频文件\n",
    "    cap = cv2.VideoCapture(file_path)\n",
    "    # 初始化一个列表来存储帧\n",
    "    frames = []\n",
    "    # 遍历给定的帧索引\n",
    "    for idx in indices:\n",
    "        # 设置视频捕获对象到特定的帧位置\n",
    "        cap.set(cv2.CAP_PROP_POS_FRAMES, idx)\n",
    "        # 读取该帧\n",
    "        ret, frame = cap.read()\n",
    "        # 将读取的帧添加到帧列表中，并且转换通道，因为opencv是BGR\n",
    "        if ret:\n",
    "            frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)  # 转换为RGB\n",
    "            frames.append(frame)\n",
    "    # 释放视频捕获对象\n",
    "    cap.release()\n",
    "    # 将帧列表转换为NumPy数组并返回\n",
    "    return np.array(frames)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "be45a1c6-467e-4861-bd41-a854663b34b6",
   "metadata": {},
   "source": [
    "## 采样和读取视频\n",
    "采样8帧并读取："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "3bb1023a-7359-4788-9cef-f32fb2feef95",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(8, 360, 640, 3)"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cap = cv2.VideoCapture(file_path)\n",
    "seg_len = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))\n",
    "indices = sample_frame_indices(clip_len=8, frame_sample_rate=1, seg_len=seg_len)\n",
    "video = read_video(file_path, indices)\n",
    "video.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3923ae67-85a9-4a92-966c-d25c21ce5949",
   "metadata": {},
   "source": [
    "## 使用mindnlp库进行文本-视频匹配\n",
    "\n",
    "本次我们使用的是`X-CLIP`，一个将语言-图像基础模型适配于通用视频识别的新框架。\n",
    "- `X-CLIP`的整体结构与`CLIP`相似，采用两个编码器分别对文本和视频进行编码，然后通过比对这些特征来实现分类。\n",
    "- `X-CLIP`引入了一个轻量级、可“即插即用”的跨帧注意力模块，用于捕捉时间信息。\n",
    "- 此外，该模型使用视频提示（Prompt），可以生成具有区分能力的视觉提示，从而提升分类效果。因此，无需额外数据，`X-CLIP` 有效地利用了预训练的语言-图像模型，通过零样本或少样本学习实现视频识别。\n",
    "\n",
    "论文信息：\n",
    "> [**Expanding Language-Image Pretrained Models for General Video Recognition**](https://arxiv.org/abs/2208.02816)<br>\n",
    "> accepted by ECCV 2022 as an oral presentation<br>\n",
    "> Bolin Ni, [Houwen Peng](https://houwenpeng.com/), [Minghao Chen](https://silent-chen.github.io/), [Songyang Zhang](https://sy-zhang.github.io/), [Gaofeng Meng](https://people.ucas.ac.cn/~gfmeng), [Jianlong Fu](https://jianlong-fu.github.io/), [Shiming Xiang](https://people.ucas.ac.cn/~xiangshiming), [Haibin Ling](https://www3.cs.stonybrook.edu/~hling/)\n",
    "\n",
    "[[arxiv]](https://arxiv.org/abs/2208.02816)\n",
    "[[slides]](https://github.com/nbl97/X-CLIP_Model_Zoo/releases/download/v1.0/xclip-slides.pptx)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "dfae1baa-0a63-4552-ac36-585896ac1f97",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/numpy/core/getlimits.py:499: UserWarning: The value of the smallest subnormal for <class 'numpy.float64'> type is zero.\n",
      "  setattr(self, word, getattr(machar, word).flat[0])\n",
      "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for <class 'numpy.float64'> type is zero.\n",
      "  return self._float_to_str(self.smallest_subnormal)\n",
      "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/numpy/core/getlimits.py:499: UserWarning: The value of the smallest subnormal for <class 'numpy.float32'> type is zero.\n",
      "  setattr(self, word, getattr(machar, word).flat[0])\n",
      "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for <class 'numpy.float32'> type is zero.\n",
      "  return self._float_to_str(self.smallest_subnormal)\n",
      "Building prefix dict from the default dictionary ...\n",
      "Loading model from cache /tmp/jieba.cache\n",
      "Loading model cost 1.286 seconds.\n",
      "Prefix dict has been built successfully.\n",
      "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindnlp/transformers/tokenization_utils_base.py:1526: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted, and will be then set to `False` by default. \n",
      "  warnings.warn(\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[MS_ALLOC_CONF]Runtime config:  enable_vmm:True  vmm_align_size:2MB\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[WARNING] CORE(90475,ffff864fc0b0,python):2024-10-24-20:59:31.816.322 [mindspore/core/utils/ms_context.cc:531] GetJitLevel] Set jit level to O2 for rank table startup method.\n"
     ]
    }
   ],
   "source": [
    "from mindnlp.transformers import XCLIPProcessor, XCLIPModel\n",
    "model_name = \"microsoft/xclip-base-patch32\"\n",
    "processor = XCLIPProcessor.from_pretrained(model_name)\n",
    "model = XCLIPModel.from_pretrained(model_name)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e820ebf1-67d6-4c54-accf-f490358b30e0",
   "metadata": {},
   "source": [
    "## 设置提示词和输入"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "0d0d94f3-94e7-4624-9860-53dece33b536",
   "metadata": {},
   "outputs": [],
   "source": [
    "inputs = processor(text=[\"playing sports\", \"eating spaghetti\", \"go shopping\"], videos=list(video), return_tensors=\"ms\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "12e3b2cd-5f64-4dc8-bd50-b0e183322708",
   "metadata": {},
   "source": [
    "## 前向计算"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "8fec4c69-b393-4f7a-9174-da1baf49f8d2",
   "metadata": {},
   "outputs": [],
   "source": [
    "outputs = model(**inputs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "c358ec73-672f-412d-ac3a-bf749d52ff8a",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Tensor(shape=[1, 3], dtype=Float32, value=\n",
       "[[ 1.26835327e+01,  2.11186066e+01,  1.28016310e+01]])"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "logits_per_video = outputs.logits_per_video  # 这是视频-文本相似度得分\n",
    "logits_per_video"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "a1b9e3fa-deac-4004-9961-813e51dd6da0",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Tensor(shape=[1, 3], dtype=Float32, value=\n",
       "[[ 2.17016917e-04,  9.99538779e-01,  2.44221010e-04]])"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 我们可以使用 softmax 来获取标签概率\n",
    "from mindspore import ops\n",
    "ops.softmax(logits_per_video,1)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "MindSpore",
   "language": "python",
   "name": "mindspore"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
